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Noise Reduction and Audio- Visual Speech Activity Detection 
FIELD AND BACKGROUND OF THE INVENTION 



The present invention generally relates to the field of noise reduction based on speech ac- 
tivity recognition, in particular to an audio-visual user interface of a telecommunication 
device running an application that can advantageously be used e.g. for a near-speaker de- 
tection algorithm in an environment where a speaker's voice is interfered by a statistically 
distributed background noise including environmental noise as well as surrounding per- 
sons' voices. 

Discontinuous transmission of speech signals based on speech/pause detection represents a 
valid solution to improve the spectral efficiency of new-generation wireless communication 
systems. In this context, robust voice activity detection algorithms are required, as conven- 
tional solutions according to the state of the art present a high misclassification rate in the 
presence of the background noise typical of mobile environments. 

A voice activity detector (VAD) aims to distinguish between a speech signal and several 
types of acoustic background noise even with low signal-to-noise ratios (SNRs). Therefore, 
in a typical telephone conversation, such a VAD, together with a comfort noise generator 
(CNG), is used to achieve silence compression. In the field of multimedia communications, 
silence compression allows a speech channel to be shared with other types of information, 
thus guaranteeing simultaneous voice and data applications. In cellular radio systems which 
are based the Discontinuous Transmission (DTK) mode, such as GSM, VADs are applied 
to reduce co-channel interference and power consumption of the portable equipment. Fur- 
thermore, a VAD is vital to reduce the average data bit rate in future generations of digital 
cellular networks such as the UMTS, which provide for a variable bit-rate (VBR) speech 
coding. Most of the capacity gain is due to the distinction between speech activity and in- 
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activity. The performance of a speech coding approach which is based on phonetic classi- 
fication, however, strongly depends on the classifier, which must be robust to every type of 
background noise. As is well known, the performance of a VAD is critical for the overall 
speech quality, in particular with low SNRs. In case speech frames are detected as noise, 
5 intelligibility is seriously impaired owing to speech clipping in the conversation. If, on the 
other hand, the percentage of noise detected as speech is high, the potential advantages of 
silence compression are not obtained. In the presence of background noise it may be diffi- 
cult to distinguish between speech and silence. Hence, for voice activity detection in wire- 
less environments more efficient algorithms are needed. 

10 

Although the Fuzzy Voice Activity Detector (FVAD) proposed in „lmproved VAD G.729 
Annex B for Mobile Communications Using Soft Computing" (Contribution ITU-T, Study 
Group 16, Question 19/16, Washington, September 2-5, 1997) by F. Beritelli, S. Casale, 
and A. Cavallaro performs better than other solutions presented in literature, it exhibits an 

15 activity increase, above all in the presence of non-stationary noise. The functional scheme 
of the FVAD is based on a traditional pattern recognition approach wherein the four differ- 
ential parameters used for speech activity/inactivity classification are the full-band energy 
difference, the low-band energy difference, the zero-crossing difference, and the spectral 
distortion. The matching phase is performed by a set of fuzzy rules obtained automatically 

20 by means of a new hybrid learning tool as described in „FuGeNeSys: Fuzzy Genetic Neural 
System for Fuzzy Modeling" by M. Russo (to appear in IEEE Transaction on Fuzzy Sys- 
tems). As is well known, a fiizzy system allows a gradual, continuous transition rather than 
a sharp change between two values. Thus, the Fuzzy VAD returns a continuous output sig- 
nal ranging from 0 (non-activity) to 1 (activity), which does not depend on whether single 

25 input signals have exceeded a predefined threshold or not, but on an overall evaluation of 
the values they have assumed („defuzzyfication process 4 *). The final decision is made by 
comparing the output of the fuzzy system, which varies in a range between 0 and 1, with a 
fixed threshold experimentally chosen as described in "Voice Control of the Pan-European 
Digital Mobile Radio System" (ICC '89, pp. 1070-1074) by C. B. Southcott et al. 

30 

Just as voice activity detectors conventional automatic speech recognition (ASR) systems 
also experience difficulties when being operated in noisy environments since accuracy of 
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conventional ASR algorithms largely decreases in noisy environments. When a speaker , is 
talking in a noisy environment including both ambient noise as well as surrounding per- 
sons' interfering voices, a microphone picks up not only the speaker's voice but also these 
background sounds. Consequently, an audio signal which encompasses the speaker's voice 
superimposed by said background sounds is processed. The louder the interfering sounds, 
the more the acoustic comprehensibility of the speaker is reduced. To overcome this prob- 
lem, noise reduction circuitries are applied that take use of the different frequency regions 
of environmental noise and the respective speaker's voice. 

A typical noise reduction circuitry for a telephony-based application based on a speech 
activity estimation algorithm according to the state of the art that implements a method for 
correlating the discrete signal spectrum S(h&f) of an analog-to-digital-converted audio sig- 
nal s(t) with an audio speech activity estimate is shown in Fig. 2a. Said audio speech activ- 
ity estimate is obtained by an amplitude detection of the digital audio signal s(nT). The 
circuit outputs a noise-reduced audio signal s,(nT), which is calculated by subjecting the 
difference of the discrete signal spectrum S(kAj) and a sampled version O m (k • A/) of the 
estimated noise power density spectrum ®„„(f) of a statistically distributed background 
noise n(t) to an Inverse Fast Fourier Transform (EFFT). 

BRIEF DESCRIPTION OF THE STATE OF THE ART 

The invention described in US 5,313,522 refers to a device for facilitating comprehension 
by a hearing-impaired person participating in a telephone conversation, which comprises a 
circuitry for converting received audio speech signals into a series of phonemes and an 
arrangement for coupling the circuitry to a POTS line. The circuit thereby includes an ar- 
rangement which correlates the detected series of phonemes with recorded lip movements 
of a speaker and displays these Up movements in subsequent images on a display device, 
thereby permitting the hearing-impaired person to carry out a lipreading procedure while 
listening to the telephone conversation, which improves the person's level of comprehen- 
sion. 
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The invention disclosed in WO 99/52097 pertains to a communication device and a method 
for sensing the movements of a speaker's hps, generating an audio signal corresponding to 
detected lip movements of said speaker and transmitting said audio signal, thereby sensing 
a level of ambient noise and accordingly controlling the power level of the audio signal to 
be transmitted. 

OBJECT OF THE UNDERLYING INVENTION 

In view of the state of the art mentioned above, it is the object of the present invention to 
enhance the speech/pause detection accuracy of a telephony-based voice activity detection 
(VAD) system. In particular, it is the object of the invention to increase the signal-to-inter- 
ference ratio (SIR) of a recorded speech signal in crowded environments where a speaker's 
voice is severely interfered by ambient noise and/or surrounding persons' voices. 

The aforementioned object is achieved by means of the features in the independent claims. 
Advantageous features are defined in the subordinate claims. 

SUMMARY OF THE INVENTION 

The present invention is dedicated to a noise reduction and automatic speech activity rec- 
ognition system having an audio-visual user interface, wherein said system is adapted for 
running an application for combining a visual feature vector o v>nT that comprises features 
extracted from a digital video sequence v(nT) showing a speaker's face by detecting and 
analyzing e.g. lip movements and/or facial expressions of said speaker Si with an audio 
feature vector Oa, nT which comprises features extracted from a recorded analog audio se- 
quence s(t). Said audio sequence s(i) thereby represents the voice of said speaker S t inter- 
fered by a statistically distributed background noise 

n'(f) = n(t)+sUt), (1) 

which includes both environmental noise n{t) and a weighted sum of surrounding persons' 
interfering voices 
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N 

*/„/(0 « Z a y ^(^"^) (for/**) (2a) 

7 = 1 

in the environment of said speaker S t . Thereby, N denotes the total number of speakers (in- 
clusive of said speaker Si), a, is the attenuation factor for the interference signal s/t) of the 
7-th speaker Sj in the environment of the speaker S t , Tj is the delay of jr/O. and R JM denotes 
the distance between they-th speaker Sj and a microphone recording the audio signal s(t). 
By tracking the Up movement of a speaker, visual features are extracted which can then be 
analyzed and used for further processing. For this reason, the bimodal perceptual user inter- 
face comprises a video camera pointing to the speaker's face for recording a digital video 
sequence v(nT) showing Up movements and/or facial expressions of said speaker S h audio 
feature extraction and analyzing means for determining acoustic-phonetic speech charac- 
teristics of the speaker's voice and pronunciation based on the recorded audio sequence 
s{f), and visual feature extraction and analyzing means for continuously or intermittently 
detennining the current location of the speaker's face, tracking lip movements and/or facial 
expressions of the speaker in subsequent images and detennining acoustic-phonetic speech 
characteristics of the speaker's voice and pronunciation based on the detected lip move- 
ments and/or facial expressions. 

According to the invention, the aforementioned extracted and analyzed visual features are 
fed to a noise reduction circuit that is needed to increase the signal-to-interference ratio 
(SIR) of the recorded audio signal s(t). Said noise reduction circuit is specially adapted to 
perform a near-speaker detection by separating the speaker's voice from said background 
noise n(r) based on the derived acoustic-phonetic speech characteristics 



2av,nr := [Oa*T T , Q^/f (3) 
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It outputs a speech activity indication signal (S^nT)) which is obtained by a combination 
of speech activity estimates supplied by said audio feature extraction and analyzing means 
as well as said visual feature extraction and analyzing means. 



5 BRIEF DESCRIPTION OF THE DRAWINGS 



Advantageous features, aspects, and useful embodiments of the invention will become evi- 
dent from the following description, the appended claims, and the accompanying drawings. 
Thereby, 

10 

Fig. 1 shows a noise reduction and speech activity recognition system having an audio- 
visual user interface, said system being specially adapted for running a real-time 
lip tracking application which combines visual features o v , nT extracted from a 
digital video sequence v(nT) showing the face of a speaker Si* by detecting and 
analyzing the speaker's lip movements and/or facial expressions with audio fea- 
tures Oa 9 nT extracted from an analog audio sequence s(t) representing the voice of 
said speaker Si interfered by a statistically distributed background noise n\t\ 



Fig. 2a is a block diagram showing a conventional noise reduction and speech activity 
recognition system for a telephony-based application based on an audio speech 
activity estimation according to the state of the art, 

Fig. 2b shows an example of a camera- enhanced noise reduction and speech activity 
recognition system for a telephony-based application that implements an audio- 
visual speech activity estimation algorithm according to one embodiment of the 
present invention, 

Fig. 2c shows an example of a camera- enhanced noise reduction and speech activity 
recognition system for a telephony-based application that implements an audio- 
visual speech activity estimation algorithm according to a further embodiment of 
the present invention, 
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Fig. 3a shows a flow chart illustrating a near-end speaker detection method reducing the 
noise level of a detected analog audio sequence s(t) according to the embodiment 
depicted in Fig. 1 of the present invention, 

Fig. 3b shows a flow chart illustrating a near-end speaker detection method according to 
the embodiment depicted in Fig. 2b of the present invention, and 

Fig. 3c shows a flow chart illustrating a near-end speaker detection method according to 
the embodiment depicted in Fig. 2c of the present invention. 

DETAILED DESCRIPTION OF THE UNDERLYING INVENTION 

In the following, different embodiments of the present invention as depicted in Figs. 1, 2b, 
2c, and 3a-c shall be explained in detail. The meaning of the symbols designated with ref- 
erence numerals and signs in Figs. 1 to 3c can be taken from an annexed table. 

According to a first embodiment of the invention as depicted in Fig. 1 , said noise reduction 
and speech activity recognition system 100 comprises a noise reduction circuit 106 which 
is specially adapted to reduce the background noise n'(i) received by a microphone 101a 
and to perform a near-speaker detection by separating the speaker's voice from said back- 
ground noise n'(f) as well as a multi-channel acoustic echo cancellation unit 108 being spe- 
cially adapted to perform a near-end speaker detection and/or double-talk detection algo- 
rithm based on acoustic-phonetic speech characteristics derived with the aid of the afore- 
mentioned audio and visual feature extraction and analyzing means 104a+b and 106b, re- 
spectively. Thereby, said acoustic-phonetic speech characteristics are based on the opening 
of a speaker's mouth as an estimate of the acoustic energy of articulated vowels or diph- 
thongs, respectively, rapid movement of the speaker's lips as a hint to labial or labio-dental 
consonants (e.g. plosive, fricative or affricative phonemes - voiced or unvoiced, respec- 
tively), and other statistically detected phonetic characteristics of an association between 
position and movement of the hps and the voice and pronunciation of a speaker S t . 
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The aforementioned noise reduction circuit 106 comprises digital signal processing means 
106a for calculating a discrete signal spectrum S(k Af) that corresponds to an analog-to- 
digital-converted version s(nT) of the recorded audio sequence s(t) by performing a Fast 
Fourier Transform (FFT), audio feature extraction and analyzing means 106b (e.g. an am- 
plitude detector) for detecting acoustic-phonetic speech characteristics of a speaker's voice 
and pronunciation based on the recorded audio sequence s(t)> means 106c for estimating 
the noise power density spectrum $ M (/) of the statistically distributed background noise 
n\t) based on the result of the speaker detection procedure performed by said audio feature 
extraction and analyzing means 106b, a subtracting element 106d for subtracting a discre- 
tized version 0 M (i« A/) of the estimated noise power density spectrum <t> M (/) from the 
discrete signal spectrum S(1ckf) of the analog-to-digital-converted audio sequence s(nT) 9 
and digital signal processing means 106e for calculating the corresponding discrete time- 
domain signal s^nT) of the obtained difference signal by performing an Inverse Fast Fou- 
rier Transform (1FFT). 

The depicted noise reduction and speech activity recognition system 100 comprises audio 
feature extraction and analyzing means 106b which are used for determining acoustic-pho- 
netic speech characteristics of the speaker's voice and pronunciation (o^^t) based on the 
recorded audio sequence s(t) and visual feature extraction and analyzing means 104a+b for 
determining the current location of the speaker's face at a data rate of 1 frame/s, tracking 
hp movements and/or facial expressions of said speaker 5* at a data rate of 15 frames/s and 
determining acoustic-phonetic speech characteristics of the speaker's voice and pronuncia- 
tion based on detected lip movements and/or facial expressions (o V9n7 ). 

As depicted in Fig. 1, said noise reduction system 200b/c can advantageously be used for a 
video-telephony based application in a telecommunication system running on a video-en- 
abled phone 102 which is equipped with a built-in video camera 101b' pointing at the face 
of a speaker Si participating in a video telephony session. 

Fig. 2b shows an example of a slow camera-enhanced noise reduction and speech activity 
recognition system 200b for a telephony-based application which implements an audio- 
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visual speech activity estimation algorithm according to one embodiment of the present 
invention. Thereby, an audio speech activity estimate taken from an audio feature vector 
supplied by said audio feature extraction and analyzing means 106b is correlated with a 
further speech activity estimate that is obtained by calculating the difference of the discrete 
signal spectrum S(hAf) and a sampled version A/) of the estimated noise power 

density spectrum O m (/) of the statistically distributed background noise n\i). Said audio 
speech activity estimate is obtained by an amplitude detection of the band-pass-filtered dis- 
crete signal spectrum S(k AJ) of the analog-to-digital-converted audio signal s(t). 

Similar to the embodiment depicted in Fig. 1, the noise reduction and speech activity rec- 
ognition system 200b depicted in Fig. 2b comprises an audio feature extraction and ana- 
lyzing means 106b (e.g. an amplitude detector) which is used for determining acoustic-pho- 
netic speech characteristics of the speaker's voice and pronunciation Gw) based on the 
recorded audio sequence s(t) and visual feature extraction and analyzing means 104' and 
104" for determining the current location of the speaker's face at a data rate of 1 frame/s, 
tracking lip movements and facial expressions of said speaker S, at a data rate of 15 
frames/s and determining acoustic-phonetic speech characteristics of the speaker's voice 
and pronunciation based on detected lip movements and/or facial expressions (o v , nT ). 
Thereby, said audio feature extraction and analyzing means 106b can simply be realized as 
an amplitude detector. 

Aside from the components 106a-e described above with reference to Fig. 1, the noise re- 
duction circuit 106 depicted in Fig. 2b comprises a delay element 204, which provides a 
delayed version of the discrete signal spectrum S(k-Af) of the analog-to-digital-converted 
audio signal s(t), a first multiplier element 107a, which is used for correlating (S9) the dis- 
crete signal spectrum S T (/cA/) of a delayed version s (nT-x) of the analog-to-digital-con- 
verted audio signal s(nT) with a visual speech activity estimate taken from a visual feature 
vector o Vtt supplied by the visual feature extraction and analyzing means 104a+b and/or 
104'+104", thus yielding a further estimate §'(/) for updating the estimate S t (f) for 
the frequency spectrum corresponding to the signal that represents said speaker's 
voice as well as a further estimate $ m '(/) for updating the estimate & m (/) for the noise 
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power density spectrum ®„„(f) of the statistically distributed background noise n'(t), and 
a second multiplier element 107, which is used for correlating (S8a) the discrete signal 
spectrum S^kAf) of a delayed version s(nT-i) of the analog-to-digital-converted audio sig- 
nal s(nT) with an audio speech activity estimate obtained by an amplitude detection (S8b) 
of the band-pass-filtered discrete signal spectrum S(k-AJ), thus yielding an estimate S,(f) 
for the frequency spectrum S l if) which corresponds to the signal stf) that represents said 
speaker's voice and an estimate a> nn (/) for the noise power density spectrum of 
said background noise n\t). A sample-and-hold (S&H) element 106d' provides a sampled 
version <£> m (k-Af) of the estimated noise power density spectrum OJ/). The noise 
reduction circuit 106 further comprises a band-pass filter with adjustable cut-off frequen- 
cies, which is used for filtering the discrete signal spectrum S(hAf) of the analog-to-digital- 
converted audio signal s(i). The cut-off frequencies can be adjusted dependent on the band- 
width of the estimated speech signal spectrum S,(f) . A switch 106f is provided for selec- 
tively switching between a first and a second mode for receiving said speech signal s,{t) 
with and without using the proposed audio-visual speech recognition approach providing a 
noise-reduced speech signal £.(*) , respectively. According to a further aspect of the pres- 
ent invention, means are provided for switching said microphone 101a off when the actual 
level of the speech activity indication signal S t (nT) falls below a predefined threshold 
value (not shown). 

An example of a fast camera-enhanced noise reduction and speech activity recognition 
system 200c for a telephony-based application which implements an audio-visual speech 
activity estimation algorithm according to a further embodiment of the present invention is 
depicted in Fig. 2c. The circuitry correlates a discrete signal spectrum S(k Af) of the analog- 
to-digital-converted audio signal s(t) with a delayed version of an audio-visual speech ac- 
tivity estimate and a further speech activity estimate obtained by calculating the difference 
spectrum of the discrete signal spectrum S(kAf) and a sampled version <P m (k • A/) of the 
estimated noise power density spectrum $„„(/) . The aforementioned audio-visual speech 
activity estimate is taken from an audio-visual feature vector o« v>t obtained by combining an 
audio feature vector supplied by said audio feature extraction and analyzing means 
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106b with a visual feature vector o VJ supplied by said visual speech activity detection mod- 
ule 104". 

Aside from the components described above with reference to Fig. 1, the noise reduction 
circuit 106 depicted in Fig. 2c comprises a summation element 107c, which is used for 
adding (SI la) an audio speech activity estimate supplied from an audio feature extraction 
and analyzing means 106b (e.g. an amplitude detector) for determining acoustic-phonetic 
speech characteristics of the speaker's voice and pronunciation (2a,„r) based on the re- 
corded audio sequence s(t) to an visual speech activity estimate supplied from visual fea- 
ture extraction and analyzing means 104' and 104" for determining the current location of 
the speaker's face at a data rate of 1 frame/s, tracking hp movements and facial expressions 
of said speaker Si at a data rate of 15 frames/s and determining acoustic-phonetic speech 
characteristics of the speaker's voice and pronunciation based on detected lip movements 
and/or facial expressions (o v ,„ T ), thus yielding an audio-visual speech activity estimate. The 
noise reduction circuit 106 further comprises a multiplier element 107', which is used for 
correlating (SI lb) the discrete signal spectrum S(k AJ) of the analog-to-digital-converted 
audio signal s(t) with an audio-visual speech activity estimate, obtained by combining an 
audio feature vector supplied by said audio feature extraction and analyzing means 
106b with a visual feature vector o Vi , supplied by said visual speech activity detection mod- 
ule 104", thereby yielding an estimate S t (/) for the frequency spectrum S t {f) which corre- 
sponds to the signal s^t) that represents the speaker's voice and an estimate $„,(/) for the 
noise power density spectrum <b m (f) of the statistically distributed background noise 
n\i). A sample-and-hold (S&H) element 106d' provides a sampled version O m (k ■ A/) of 
the estimated noise power density spectrum O n „(/) . The noise reduction circuit 106 fur- 
ther comprises a band-pass filter with adjustable cut-off frequencies, which is used for fil- 
tering the discrete signal spectrum S(k hf) of the analog-to-digital-converted audio signal 
s(t). Said cut-off frequencies can be adjusted dependent on the bandwidth of the estimated 
speech signal spectrum S.(/). A switch 106f is provided for selectively switching be- 
tween a first and a second mode for receiving said speech signal s,{t) with and without us- 
ing the proposed audio-visual speech recognition approach providing a noise-reduced 
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speech signal S,(t), respectively. According to a further aspect of the present invention, 
said noise reduction system 200c comprises means (SW) for switching said microphone 
101a off when the actual level of the speech activity indication signal S^nT) falls below a 
predefined threshold value (not shown). 

A still further embodiment of the present invention is directed to a near-end speaker detec- 
tion method as shown in the flow chart depicted in Fig. 3a. Said method reduces the noise 
level of a recorded analog audio sequence <f) being interfered by a statistically distributed 
background noise n\t\ said audio sequence representing the voice of a speaker S t . After 
having subjected (SI) the analog audio sequence s(i) to an analog-to-digital conversion, the 
corresponding discrete signal spectrum S(hAf) of the analog-to-digital-converted audio 
sequence s(nT) is calculated (S2) by performing a Fast Fourier Transform (FFT) and the 
voice of said speaker S< is detected (S3) from said signal spectrum SfrAf) by analyzing 
visual features extracted from a simultaneously with the recording of the analog audio se- 
quence s(t) recorded video sequence v(nT) tracking the current location of the speaker's 
face, hp movements and/or facial expressions of the speaker S, in subsequent images. Next, 
the noise power density spectrum O m (/) of the statistically distributed background noisl 
n'(t) is estimated (S4) based on the result of the speaker detection step (S3), whereupon a 
sampled version $„„(* • A/) of the estimated noise power density spectrum O^f) is 
subtracted (S5) from the discrete spectrum S(k A/) of the analog-to-digital-converted audio 
sequence s(nT). finally, the corresponding discrete time-domain signal s, (nT) of the ob- 
tained difference signal, which represents a discrete version of the recognized speech sig- 
nal, is calculated (S6) by performing an Inverse Fast Fourier Transform (IFFT). 

Optionally, a multi-channel acoustic echo cancellation algorithm which models echo path 
impulse responses by means of adaptive finite impulse response (FIR) filters and subtracts 
echo signals from the analog audio sequence s(t) can be conducted (S7) based on acoustic- 
phonetic speech characteristics derived by an algorithm for extracting visual features from 
a video sequence tracking the location of a speaker's face, Up movements and/or facial ex- 
pressions of the speaker S, in subsequent images. Said multi-channel acoustic echo cancel- 
lation algorithm thereby performs a double-talk detection procedure. 
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According to a further aspect of the invention, a learning procedure is applied which en- 
hances the step of detecting (S3) the voice of said speaker Si from the discrete signal spec- 
trum S(h AO of the analog-to-digital-converted version s(nT) of an analog audio sequence 
s(t) by analyzing visual features extracted from a simultaneously with the recording of the 
analog audio sequence s(t) recorded video sequence tracking the current location of the 
speaker's face, Up movements and/or facial expressions of the speaker S { in subsequent im- 
ages. 

In one embodiment of the present invention, which is illustrated in the flow charts depicted 
in Figs. 3aH-b, a near-end speaker detection method is proposed that is characterized by the 
step of correlating (S8a) the discrete signal spectrum S T (fcAf) of a delayed version s(nT-x) 
of the analog-to-digital-converted audio signal s(nT) with an audio speech activity estimate 
obtained by an amplitude detection (S8b) of the band-pass-filtered discrete signal spectrum 
S(hAj), thereby yielding an estimate £.(/) for the frequency spectrum Sff) which corre- 
sponds to the signal s$) representing said speaker's voice and an estimate <i> nn (/) for the 
noise power density spectrum <& nn (/) of said background noise n(t). Moreover, the dis- 
crete signal spectrum S T (hAf) of a delayed version s(nT-%) of the analog-to-digital-con- 
verted audio signal s{nT) is correlated (S9) with a visual speech activity estimate taken 
from a visual feature vector o Vt( which is supplied by the visual feature extraction and ana- 
lyzing means 104a+b and/or 104M-104", thus yielding a further estimate for up- 
dating the estimate £,(/) for the frequency spectrum Sffl which corresponds to the signal 
s t {t) representing the speaker's voice as well as a further estimate <£> M '(f) that is used for 
updating the estimate O nn (/) for the noise power density spectrum <E> nn (/) of the statisti- 
cally distributed background noise n\t). The noise reduction circuit 106 thereby provides a 
band-pass filter 204 for filtering the discrete signal spectrum S(hAf) of the analog-to- 
digital-converted audio signal s(i), wherein the cut-off frequencies of said band-pass filter 
204 are adjusted (S10) dependent on the bandwidth of the estimated speech signal spec- 
trum 
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In a further embodiment of the present invention as shown in the flow charts depicted in 
Figs. 3a+c a near-end speaker detection method is proposed which is characterized by the 
step of adding (Si la) an audio speech activity estimate obtained by an amplitude detection 
of the band-pass-filtered discrete signal spectrum S(kAfl of the analog-to-digital-converted 
audio signal s(i) to a visual speech activity estimate taken from a visual feature vector o v , t 
supplied by said visual feature extraction and analyzing means 104a+b and/or 104'+104», 
thereby yielding an audio-visual speech activity estimate. According to this embodiment,' 
the discrete signal spectrum SQcAJ) is correlated (SI lb) with the audio-visual speech activ- 
ity estimate, thus yielding an estimate £,(/) for the frequency spectrum Stf) correspond- 
ing to the signal s,{t) that represents said speaker's voice as well as an estimate O nn (/) for 
the noise power density spectrum <!>„,(/) of the statistically distributed background noise 
«'(/). The cut-off frequencies of the band-pass filter 204 that is used for filtering the dis- 
crete signal spectrum S(tAJ) of the analog-to-digital-converted audio signal s(t) are ad- 
justed (S 1 lc) dependent on the bandwidth of the estimated speech signal spectrum S t (/) . 

Finally, the present invention also pertains to the use of a noise reduction system 200b/c 
and a corresponding near-end speaker detection method as described above for a video-te- 
lephony based application (e.g. a video conference) in a telecommunication system running 
on a video-enabled phone having a built-in video camera 101b' pointing at the face of a 
speaker S t participating in a video telephony session. This especially pertains to a scenario 
where a number of persons are sitting in one room equipped with many cameras and mi- 
crophones such that a speaker's voice interferes with the voices of the other persons. 
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Table; Depicted Featur es and Their Corresponding Reference Kif mc 



No. 



100 



101a 



Technical Feature (System Component or Procedure Step) 



noise reduction and speech activity recognition system having an audio-visual user inter-' 
face, said system being specially adapted for running a real-time Up tracking appUcation 
which combines visual features o v , nT extracted from a digital video sequence v(nT) 
showing the face of a speaker S, by detecting and analyzing the speaker's Up movements 
and/or facial expressions with audio features <w extracted from an analog audio se 
quence s(t) representing the voice of said speaker S, interfered by a statistical distrib- 
uted background noise «'(*), wherein said audio sequence s(t) includes - aside from the 
signal representing the voice of said speaker S t - both environmental noise n(t) and a 
weighted sum Tj aysft-Tj) (j * i) of surrounding persons' interfering voices in the envi- 
ronment of said speaker S ( 



microphone, used for recording an analog audio sequence s(t) representing the voice of a 
speaker & interfered by a statistically distributed background noise n\t\ which includes 
both environmental noise n(t) and a weighted sum Yj a fSj {t-Tj) (with j * i) of surround- 
ing persons' interfering voices in the environment of said speaker S t 




104 



visual front end of an automatic audio-visual speech recognition system 100 using a 
bimodal approach to speech recognition and near-speaker detection by incorporating a 
real-time Up tracking algorithm for deriving additional visual features from Up move- 
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ments and/or facial expressions of a speaker S, whose voice is interfered by a statistd 
cally distributed background noise n'(t\ the visual front end 104 comprising visual fea- 
ture extraction and analyzing means for continuously or intermittently detennining the 
current location of the speaker's face, tracking hp movements and/or facial expressions 
of the speaker S t in subsequent images and determining acoustic-phonetic speech char- 
acteristics of the speaker's voice and pronunciation based on detected hp movements 
and/or facial expressions 


104' 


visual feature extraction module for continuously tracking Up movements and/or facial] 
expressions of the speaker S t and determining acoustic-phonetic speech characteristics 
of the speaker's voice based on detected hp movements and/or facial expressions 


104" 


visual speech activity detection module for analyzing the acoustic-phonetic speech chad 
acteristics and detecting speech activity of a speaker based on said analysis | 


104a 


visual feature extraction means for continuously or intermittently determining the curd 
rent location of the speaker's face recorded by a video camera 101b at a rate of 1 frame/s 


104b 


visual feature extraction and analyzing means for continuously tracking Up movements 
and/or facial expressions of the speaker S t and determining acoustic-phonetic speech 
characteristics of said speaker's voice based on detected lip movements and/or facial 
expressions at a rate of 1 5 frames/s 


106 


noise reduction circuit being specially adapted to reduce statistically distributed bacd 
ground noise n\t) received by said microphone 101a and perform a near-speaker detec- 
tion by separating the speaker's voice from said background noise n\t) based on a com- 
bination of the speech characteristics which are derived by said audio and visual feature 
extraction and analyzing means 1 04a+b and 1 06b, respectively | 


106a 


digital signal processing means for calculating the discrete signal spectrum S&Zj) that 
corresponds to an analog-to-digital-converted version s(nT) of the recorded audio se- 
quence s(t) by performing a Fast Fourier Transform (FFT) | 


106b 


audio feature extraction and analyzing means (e.g. an amplitude detector) for detecting 
acoustic-phonetic speech characteristics of the speaker's voice and pronunciation based 
:>n the recorded audio sequence s(t) 


106c i 


neans for estimating the noise power density spectrum d» nn (/) of the statistically dis- l 
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1 tnbuted background noise n\t) based on the result of the speaker detection procedure" 
performed by said audio-visual feature extraction and analyzing means 104b, 106b, 104' 
and/or 104" 


106c' 


means for estimating the signal spectrum Stf) of the recorded speech signal s,{i) based" 
on the result of the speaker detection procedure performed by said audio-visual feature 
extraction and analyzing means 1 04b 1 06b 1 04 ' anH/nr i n4" 


106d 


I subtracting element for subtracting a discretized version • A/) of the estimated 
noise power density spectrum $ (/) from the discrete sienal soectrum WJrAA n f n*m 
analog-to-digital-converted audio sequence s(nT) 


106d' 


1 sample-and-hold (S&H) element providing a sampled version G> m (k • A/) of the esti- 
ixidLeu noise power density spectrum CP^ (y ) 


106e 


digital signal processing means for calculating the corresponding discrete time-domain 
signal s ,(nT) of the obtained difference signal by perfonning an Inverse Fast Fourier 
Transform (IFFT) 


106f 


switch for selectively switching between a first and a second mode for receiving said 
speech signal s t {t) with and without using the proposed audio-visual speech recognition 
approach providing a noise-reduced speech signal S. (r) , respectively 


107 


multipher element, used for correlating the discrete signal spectrum S(k Af) of the ana- 
log-to-digital-converted audio signal s(t) with an audio speech activity estimate which is 
obtained by an amplitude detection of the digital audio signal s(nT) 


107' 


multipher element, used for correlating (SI lb) the discrete signal spectrum S(hAJ) of 
the analog-to-digital-converted audio signal s(t) with an audio-visual speech activity es- 
timate, obtained by combining an audio feature vector supplied by said audio feature 
extraction and analyzing means 106b with a visual feature vector o v , t supplied by said 
visual speech activity detection module 104", thereby yielding an estimate S ( A for 
the frequency spectrum Sffl corresponding to the signal sft) which represents said 
speaker's voice and an estimate O m (/) for the noise power density spectrum <D„„ (/) 
of the statistically distributed background noise ri(t) 
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Technical Feature (System Component or Procedure Step) 



107b 



multiplier element, used for correlating (S9) the discrete signal spectrum S x (kAfltfa\ 
delayed version S (nT-x) of the analog-to-digital-converted audio signal s(nT) with a vis- 
ual speech activity estimate taken from a visual feature vector o v , t supplied by the visual 
feature extraction and analyzing means 104a+b and/or 104M-104", thereby yielding a 
further estimate £,'(/) for updating the estimate $.(/) for the frequency spectrum, 
Stf) corresponding to the signal s,<t) which represents said speaker's voice as well as a 
further estimate O m >(/) for updating the estimate O m (/) for the noise power density | 
spectrum <D ftn (/) of the statistically distributed background noise n'(t) 
multiplier element, used for correlating (S8a) the discrete signal spectrum sjfifl of aj 
delayed version s(»T-x) of the analog-to-digital-converted audio signal s(nT) with an 
audio speech activity estimate obtained by an amplitude detection (S8b) of the band- 
pass-filtered discrete signal spectrum S(kAf), thereby yielding an estimate 3,(f) for the 
frequency spectrum Stf) corresponding to the signal Si (t) which represents said speaker's 1 
voice as well as an estimate O n „(/) for the noise power density spectrum $>„,,(/) 0 f 
the statistically distributed background noise n'(f) 



108a 



summation element, used for adding (SI la) the audio speech activity estimate to the" 
visual speech activity estimate, thereby yielding an audio-visual spe ech activity estimate 
multi-channel acoustic echo cancellation unit being sp ecially adapted to perform a near- ' 
end speech detection and/or double-talk detection algorithm based on acoustic-phonetic 
speech characteristics derived by said audio and visual feature extraction and analyzing 
means 104a+b and 106b, respectively 



means for near-end talk and/or double-talk detection, mtegrated in the multi-channel 
acoustic echo cancellation unit 108 



200a 



block diagram showing a conventional noise reduction and speech activity recognition! 
system for a telephony-based application based on an audio speech activity estimation 
according to the state of the art, wherein the discrete signal spectrum S(k Af) of the ana- 
log-to-digital-converted audio signal s(f) is correlated with an audio speech activity es- 
timate which is obtained by an amplitude detection of the digit al audio signal ,(„7) 
[block diagram showing an example of a slow camera-enhanced noise reduction!^ 
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speech activity recognition system for a telephony-based application implementing an" 
audio-visual speech activity estimation algorithm according to one embodiment of the 
present invention, wherein the discrete signal spectrum S T (frAfl of a delayed version 
\s(nT-x) of the analog-to-digital-converted audio signal s(nT) is correlated (S8a) with an 
audio speech activity estimate obtained by an amplitude detection (S8b) of the band- 
pass-filtered discrete signal spectrum S(k AJ), thereby yielding an estimate S, (/) for the 
frequency spectrum Sffl corresponding to the signal stf) which represents said speaker's 
voice and an estimate O nn (/) for the noise power density spectrum O nn (/) of the sta- 
tistically distributed background noise n\t), and also correlated (S9) with a visual 
speech activity estimate taken from a visual feature vector o VJt supplied by the visual 
feature extraction and analyzing means 104a+b and/or 104'+104", thereby yielding a 
further estimate £/(/) for updating the estimate S,(f) for the frequency spectrum 
\S,if) corresponding to the signal Sj (t) which represents said speaker's voice as well as a 
further estimate O nB '(/) for updating the estimate O nn (/) for the noise power density 
spectrum <D nn (/) of the statistically distributed background noise n'(r) 



| 200c block diagram showing an example of a fast camera-enhanced noise reduction and 
speech activity recognition system for a telephony-based application implementing an 
audio-visual speech activity estimation algorithm according to a further embodiment of 
the present invention, wherein the discrete signal spectrum S(hAf) of the analog-to- 
digital-converted audio signal s(t) is correlated (SI lb) with an audio-visual speech ac- 
tivity estimate, obtained by combining an audio feature vector ^ which is supplied by 
said audio feature extraction and analyzing means 106b with a visual feature vector o v , t 
supplied by the visual speech activity detection module 104", thereby yielding an esti- 
mate §,(/) for the corresponding frequency spectrum S t <j) of the signal s t {t) which rep- 
resents said speaker's voice as well as an estimate $„,,(/) for the noise power density 
spectrum <£„„(/) of the statistically distributed background noise n'(t) 



202 



delay element, providing a delayed version of the discrete signal spectrum S(hAf) of the 
analog-to-digital-converted audio signal j(r) 
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204 
300a 


1 band-pass filter with adjustable cut-off frequencies which can be adjusted dependent - ^ 
the bandwidth of the estimated speech signal spectrum S t {f) , used for filtering the dis- 
crete signal spectrum S(kAf) of the analog-to-digital-converted audio signal s(t) 
flow chart illustrating a near-end speaker detection method reducing the noise level of a] 
detected analog audio sequence s (t) according to the embodiment depicted in Fig. 1 of 
the present invention 


300b 


flow chart illustrating a near-end speaker detection method according to the embodUm^xl 
1 depicted in Fig. 2b of the present invention j 


1 3UUC 


xiow cnart illustrating a near-end speaker detection method according to the embodhn^H 
| depicted in Fig. 2c of the present invention | 


sw 


means for switching said microphone 101a off when the actual level of the spee^hld 
tivity indication signal s t {nT) falls below a predefined threshold value (not shown) 


SI 


.ky M . mbjecnng me analog audio sequence s(t) to an analog-to-digital conversion ~~j 


S10 


step #10: adjusting the cut-off frequencies of the band-pass filter 204 used for filteringl 
the discrete signal spectrum S(kAf) of the analog-to-digital-converted audio signal (s(t» 
dependent on the bandwidth of the estimated speech signal spectrum S, (/) 


Slla 
Sllb 

I 


step #1 la: adding an audio speech activity estimate which is obtained by an aniplu^del 
detection of the band-pass-filtered discrete signal spectrum S(k AJ) of the analog-to- 
digital-converted audio signal S (t) to a visual speech activity estimate taken from a vis- 
ual feature vector o Vit supplied by the visual feature extraction and analyzing means 
104a+b and/or 104' + 104", thereby yielding an audio-visual speech activity estimate 

P «-iid. correiatmg the discrete signal spectrum S(hAf) with the audio-visual speech 
activity estimate, thereby yielding an estimate S,(f) for the frequency spectrum S t <f) 
corresponding to the signal s,<0 which represents said speaker's voice as well as an es- 
tinwte ® M (/) for said noise power density spectrum <D nn (/) 


Sllc . 
1 
< 


step #llc: adjusting the cut-off frequencies of a band-pass filter 204 used for filtering] 
the discrete signal spectrum S(k-Af) of the analog-to-digital-converted audio signal s(t) 
iependent on the bandwidth of the estimated speech signal spectrum S, (/) 
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S2 


1 step #2: calculating the corresponding discrete signal spectrum S(hAf) of the analog-to- 
digital-converted audio sequence s(nT) by performing a Fast Fourier Transform (FFT) 


S3 


1 step #3: detecting the voice of said speaker S, from said signal spectrum S(kAf) by anlT 
lyzing visual features extracted from a simultaneously with the recording of the audio 
sequence s(t) recorded video sequence for tracking the current location of the speaker's 
face, lip movements and/or facial expressions of the speaker St in subsequent images, 


S4 


1 step #4. estimating the noise power density spectrum (/) of the statistically distrib- 

uted background noise rC(t\ ha«jpH nn t"h*» rM»i+ nffu Q a^*. *i 

1 o* vlulu uuwc n \ L ) Uc wcu on tne result oi ine speaker detection step S3 


S5 


I step #5: subtracting a discretized version O m (k ■ Af) of the estimated noise power den- 
siry spectrum <D m (/) from the discrete signal spectrum S{kAJ) of the analog-to-digital- 
converted audio sequence s(nT) 


S6 


[step #6. calculating the corresponding discrete time-domain signal s^nT) of the ob- 
tained difference signal by performing an Inverse Fast Fourier Transform (TFFT) 


S7 


step #7. conducting a multi-channel acoustic echo cancellation algorithm which models 
echo path impulse responses by means of adaptive finite impulse response (FIR) filters 
and subtracts echo signals from the analog audio sequence s(t) based on acoustic- 
phonetic speech characteristics derived by an algorithm for extracting visual features 
from a video sequence tracking the location of a speaker's face, Up movements and/or 
facial expressions of the speaker S t in subsequent images 


S8o 


step #8o: band-pass-filtering the discrete signal spectrum S(kAf} of the analog-to-digi- 
tal-converted audio signal s(nT) 


S8a 


step #8a: correlating the discrete signal spectrum Stf Af) of a delayed version s(nT-x) of 
the analog-to-digital-converted audio signal s(nT) with an audio speech activity estimate 
obtained by the amplitude detection step S8b 


S8b 

h 


step #8b: amplitude detection of the band-pass-filtered discrete signal spectrum SfrAfr 
thereby yielding an estimate S, (/) for the frequency spectrum S t <f) corresponding to the 
signal sit) which represents said speaker's voice as well as an estimate ® M (f) for the 
noise power density spectrum <J> nn (/) 



WO 2004/066273 



22 



PCT/EP2004/000104 



No. 


Technical Feature (System Component or Procedure Step) 


S9 


step #9: correlating the discrete signal spectrum S x (k-ty) of a delayed version s(nT-x) of 
the analog-to-digital-converted audio signal s(nT) with a visual speech activity estimate 
taken from a visual feature vector o v4 supplied by the visual feature extraction and ana- 
lyzing means 104a+b and/or 104'+104", thereby yielding a further estimate S i '(/) for 
updating the estimate S. (/) for the frequency spectrum SKf) corresponding to the sienal 
j,<0 which represents said speaker's voice as well as a further estimate 0 Bn '(/) for up- 
dating the estimate O m (/) for the noise power density spectrum <D Bn (/) of the statis- 

V * V * Q **JF VilOV4.JLViU.LCkl. U ClvylV^i VJ VlXIvl IlVJlSC ft I t 1 


S10 


step #10: adjusting the cut-off frequencies of a band-pass filter 204 used for filtering the 
discrete signal spectrum S(k Af) of the analog-to-digital-converted audio signal s(t) de- 
pendent on the bandwidth of the estimated speech signal spectrum S, (/) 


Slla 


step #lla: adding an audio speech activity estimate obtained by an amplitude detection 
of the band-pass-filtered discrete signal spectrum S(k Af) of the analog-to-digital-con- 
verted audio signal s(t) to a visual speech activity estimate taken from a visual feature 
vector o v>t supplied by said visual feature extraction and analyzing means 104a+b, and/or 
104'+104", thereby yielding an audio-visual speech activity estimate 


SI lb 


step #llb: correlating the discrete signal spectrum S(k-Af) with the audio-visual speech 
a^uvny cbumdie, mus yieiamg an estimate b^J ) for the frequency spectrum £,{/) cor- 
responding to the signal stf) which represents said speaker's voice as well as an esti- 
mate for the noise power density spectrum <& m (f) of the statistically distrib- 
uted background noise n'(t) 


Sllc 


step #llc: adjusting the cut-off frequencies of a band-pass filter 204 used for filtering 
the discrete signal spectrum S(k Af) of the analog-to-digital-converted audio signal s(t) 
dependent on the bandwidth of the estimated speech signal spectrum S,(f) 



